Many times when reading through complex threads, research papers or even blogs by some of the more advanced SEOs in the industry, I get lost in the meaning of terms and an entire paragraph or document can be lost to my ignorance. Luckily, great resources like the Modern Information Retrieval Glossary from Berkeley University.
I’ve picked out some of the more important terms to know:
- Clustering – the grouping of documents which satisfy a set of common properties. The aim is to assemble together documents which are related among themselves. Clustering can be used, for instance, to expand a user query with new and related index terms.
- E measure – an information retrieval performance measure, distinct from the harmonic mean, which combines recall and precision.
- Generalized vector space model – a generalization of the classic vector model based on a less restrictive interpretation of term-to-term independence.
- Information retrieval – (IR) part of computer science which studies the retrieval of information (not data) from a collection of written documents. The retrieved documents aim at satisfying a user information need usually expressed in natural language.
- Latent semantic indexing – an algebraic model of document retrieval based on a singular value decomposition of the vectorial space of index terms.
- Probabilistic model – a classic model of document retrieval based on a probabilistic interpretation of document relevance (to a given user query).
- Stemming – a technique for reducing words to their grammatical roots.
- TREC collection – a reference collection which contains over a million documents and which has been used extensively in the TREC conferences. The TREC collection has been organized by NIST and is becoming a standard for comparing IR models and algorithms.
- Zipf’s Law – an empirical rule that describes the frequency of the text words. It states that the i-th most frequent word appears as many times as the most frequent one divided by iø, for some ø <= 1.